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Abstract 

As deep nets are increasingly used in applica¬ 
tions suited for mobile devices, a fundamen¬ 
tal dilemma becomes apparent: the trend in 
deep learning is to grow models to absorb ever- 
increasing data set sizes; however mobile devices 
are designed with very little memory and cannot 
store such large models. We present a novel net¬ 
work architecture, HashedNets, that exploits in¬ 
herent redundancy in neural networks to achieve 
drastic reductions in model sizes. HashedNets 
uses a low-cost hash function to randomly group 
connection weights into hash buckets, and all 
connections within the same hash bucket share 
a single parameter value. These parameters are 
tuned to adjust to the HashedNets weight sharing 
architecture with standard backprop during train¬ 
ing. Our hashing procedure introduces no ad¬ 
ditional memory overhead, and we demonstrate 
on several benchmark data sets that HashedNets 
shrink the storage requirements of neural net¬ 
works substantially while mostly preserving gen¬ 
eralization performance. 

1. Introduction 

In the past decade deep neural networks have set new 
performance standards in many high-impact applications. 
These include object classification (Krizhevsky et al., 2012; 
Sermanet et al., 2013), speech recognition (Hinton et al., 
2012), image caption generation (Vinyals et al., 2014; 
Karpathy & Fei-Fei, 2014) and domain adaptation (Glo- 
rot et al., 2011b). As data sets increase in size, so do 
the number of parameters in these neural networks in or¬ 
der to absorb the enormous amount of supervision (Coates 
et al., 2013). Increasingly, these networks are trained on 
industrial-sized clusters (Le, 2013) or high-performance 
graphics processing units (GPUs) (Coates et al., 2013). 


Simultaneously, there has been a second trend as applica¬ 
tions of machine learning have shifted toward mobile and 
embedded devices. As examples, modem smart phones are 
increasingly operated through speech recognition (Schus¬ 
ter, 2010), robots and self-driving cars perform object 
recognition in real time (Montemerlo et al., 2008), and 
medical devices collect and analyze patient data (Lee & 
Verma, 2013). In contrast to GPUs or computing clus¬ 
ters, these devices are designed for low power consumption 
and long battery life. Most importantly, they typically have 
small working memory. For example, even the top-of-the- 
line iPhone 6 only features a mere 1GB of RAM.' 

The disjunction between these two trends creates a 
dilemma when state-of-the-art deep learning algorithms are 
designed for deployment on mobile devices. While it is 
possible to train deep nets offline on industrial-sized clus¬ 
ters (server-side), the sheer size of the most effective mod¬ 
els would exceed the available memory, making it pro¬ 
hibitive to perform testing on-device. In speech recog¬ 
nition, one common cure is to transmit processed voice 
recordings to a computation center, where the voice recog¬ 
nition is performed server-side (Chun & Maniatis, 2009). 
This approach is problematic, as it only works when suf¬ 
ficient bandwidth is available and incurs artificial delays 
through network traffic (Kosner, 2012). One solution is to 
train small models for the on-device classification; how¬ 
ever, these tend to significantly impact accuracy (Chun & 
Maniatis, 2009), leading to customer frustration. 

This dilemma motivates neural network compression. Re¬ 
cent work by Denil et al. (2013) demonstrates that there 
is a surprisingly large amount of redundancy among the 
weights of neural networks. The authors show that a small 
subset of the weights are sufficient to reconstruct the entire 
network. They exploit this by training low-rank decompo¬ 
sitions of the weight matrices. Ba & Caruana (2014) show 
that deep neural networks can be successfully compressed 

'http : //en . wikipedia . org/wiki/IPhone_6 
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into “shallow” single-layer neural networks by training the 
small network on the (log-) outputs of the fully trained deep 
network (Bucilu et al., 2006). Courbariaux et al. (2014) 
train neural networks with reduced bit precision, and, long 
predating this work, LeCun et al. (1989) investigated drop¬ 
ping unimportant weights in neural networks. In summary, 
the accumulated evidence suggests that much of the infor¬ 
mation stored within network weights may be redundant. 

In this paper we propose HashedNets, a novel network 
architecture to reduce and limit the memory overhead of 
neural networks. Our approach is compellingly simple; 
we use a hash function to group network connections into 
hash buckets uniformly at random such that all connec¬ 
tions grouped to the hash bucket share the same weight 
value Wi- Our parameter hashing is akin to prior work in 
feature hashing (Weinberger et al., 2009; Shi et al., 2009; 
Ganchev & Dredze, 2008) and is similarly fast and requires 
no additional memory overhead. The backpropagation al¬ 
gorithm (LeCun et al., 2012) can naturally tune the hash 
bucket parameters and take into account the random weight 
sharing within the neural network architecture. 

We demonstrate on several real world deep learning bench¬ 
mark data sets that HashedNets can drastically reduce the 
model size of neural networks with little impact in predic¬ 
tion accuracy. Under the same memory constraint, Hashed¬ 
Nets have more adjustable free parameters than the low- 
rank decomposition methods suggested by Denil et al. 
(2013), leading to smaller drops in descriptive power. 

Similarly, we also show that for a finite set of parameters 
it is beneficial to “inflate” the network architecture by re¬ 
using each parameter value multiple times. Best results are 
achieved when networks are inflated by a factor 8-16 x. 
The “inflation” of neural networks with HashedNets im¬ 
poses no restrictions on other network architecture design 
choices, such as dropout regularization (Srivastava et al., 
2014), activation functions (Glorot et al., 2011a; LeCun 
et al., 2012), or weight sparsity (Coates et al., 2011). 

2. Feature Hashing 

Learning under memory constraints has previously been 
explored in the context of large-scale learning for sparse 
data sets. Feature hashing (or the hashing trick) (Wein¬ 
berger et al., 2009; Shi et al., 2009) is a technique to 
map high-dimensional text documents directly into bag-of- 
word (Salton & Buckley, 1988) vectors, which would oth¬ 
erwise require use of memory consuming dictionaries for 
storage of indices corresponding with specific input terms. 

Formally, an input vector x G TZ'^ is mapped into a feature 
space with a mapping function (j>: TZ'^ -G TZ^ where k^d. 
The mapping (jj is based on two (approximately uniform) 
hash functions L: N —>■ {1,..., fc} and N —>■ { — 1, +1} 


and the dimension of the hashed input x is defined as 

4>k{^) = X]i:h(i)=fe 

The hashing trick leads to large memory savings for two 
reasons: it can operate directly on the input term strings 
and avoids the use of a dictionary to translate words into 
vectors; and the parameter vector of a learning model lives 
within the much smaller dimensional 7^^ instead of TZ‘^. 
The dimensionality reduction comes at the cost of colli¬ 
sions, where multiple words are mapped into the same 
dimension. This problem is less severe for sparse data 
sets and can be counteracted through multiple hashing (Shi 
et al., 2009) or larger hash tables (Weinberger et al., 2009). 

In addition to memory savings, the hashing trick has the ap¬ 
pealing property of being sparsity preserving, fast to com¬ 
pute and storage-free. The most important property of the 
hashing trick is, arguably, its (approximate) preservation 
of inner product operations. The second hash function, 
guarantees that inner products are unbiased in expecta¬ 
tion (Weinberger et al., 2009); that is, 

E[(/)(x)^(;!)(x')]0 = x^x'. (1) 

Finally, Weinberger et al. (2009) also show that the hash¬ 
ing trick can be used to learn multiple classifiers within 
the same hashed space. In particular, the authors use it 
for multi-task learning and define multiple hash functions 
(pi,, (pT, one for each task, that map inputs for their re¬ 
spective tasks into one joint space. Let Wi,..., Wt denote 
the weight vectors of the respective learning tasks, then if 
t' t a classifier for task t' does not interfere with a hashed 
input for task t\ i.e. <pt' (x) ~ 0. 

3. Notation 

Throughout this paper we type vectors in bold (x), scalars 
in regular {C or b) and matrices in capital bold (X). Spe¬ 
cific entries in vectors or matrices are scalars and follow the 
corresponding convention, i.e. the dimension of vector 
X is Xi and the entry of matrix V is Vij. 

Feed Forward Neural Networks. We deflne the forward 
propagation of the layer in a neural networks as, 

' = /(4+'), where zf+i = UZ-a^, (2) 

J=o 

where is the (virtual) weight matrix in the layer. 
The vectors z^,a^ G 7^” denote the activation units be¬ 
fore and after transformation through the transition func¬ 
tion /(•). Typical activation functions are rectifier linear 
unit (ReLU) (Nair & Hinton, 2010), sigmoid or tanh (Le¬ 
Cun et al., 2012). 
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4. HashedNets 

In this section we present HashedNets, a novel variation of 
neural networks with drastically reduced model sizes (and 
memory demands). We first introduce our approach as a 
method of random weight sharing across the network con¬ 
nections and then describe how to facilitate it with the hash¬ 
ing trick to avoid any additional memory overhead. 

4.1. Random weight sharing 

In a standard fully-connected neural network, there are 
weighted connections between a pair of lay¬ 
ers, each with a corresponding free parameter in the weight 
matrix V^. We assume a hnite memory budget per layer, 
<C {n^ -f 1) X that cannot be exceeded. The ob¬ 
vious solution is to ht the neural network within budget by 
reducing the number of nodes n^, in layers £ -f 1 or 
by reducing the bit precision of the weight matrices (Cour- 
bariaux et al., 2014). However if is sufficiently small, 
both approaches significantly reduce the ability of the neu¬ 
ral network to generalize (see Section 6). Instead, we pro¬ 
pose an alternative; we keep the size of untouched but 
reduce its effective memory footprint through weight shar¬ 
ing. We only allow exactly different weights to occur 
within V^, which we store in a weight vector S TZ^ . 
The weights within are shared across multiple randomly 
chosen connections within V^. We refer to the resulting 
matrix as virtual, as its size could be increased {i.e. 
nodes are added to hidden layer) without increasing the ac¬ 
tual number of parameters of the neural network. 

Figure 1 shows a neural network with one hidden layer, 
four input units and two output units. Connections are 
randomly grouped into three categories per layer and their 
weights are shown in the virtual weight matrices and 
V^. Connections belonging to the same color share the 
same weight value, which are stored in and w^, respec¬ 
tively. Overall, the entire network is compressed by a fac¬ 
tor 1/4, i.e. the 24 weights stored in the virtual matrices 
and are reduced to only six real values in and 
. On data with four input dimensions and two output di¬ 
mensions, a conventional neural network with six weights 
would be restricted to a single (trivial) hidden unit. 

4.2. Hashed Neural Nets (HashedNets) 

A naive implementation of random weight sharing can be 
trivially achieved by maintaining a secondary matrix con¬ 
sisting of each connection’s group assignment. Unfortu¬ 
nately, this explicit representation places an undesirable 
limit on potential memory savings. 

We propose to implement the random weight sharing as¬ 
signments using the hashing trick. In this way, the shared 
weight of each connection is determined by a hash function 


input layer 



Figure 1. An illustration of a neural network with random weight 
sharing under compression factor |. The 16-1-9 = 24 virtual 
weights are compressed into 6 real weights. The colors represent 
matrix elements that share the same weight value. 

that requires no storage cost with the model. Specihcally, 
we assign to 14^ an element of indexed by a hash func¬ 
tion h^{i,j), as follows: 

(3) 

where the (approximately uniform) hash function 
maps a key {i,j) to a natural number within {1,..., K^}. 
In the example of Figure 1, h^(2,l) = 1 and therefore 
V 2 i=w^ =3.2. For our experiments we use the open- 
source implementation xxHash.^ 

4.3. Feature hashing versus weight sharing 

This section focuses on a single layer throughout and to 
simplify notation we will drop the super-scripts £. We will 
denote the input activation as a = a^ G of dimension¬ 
ality m = n^. We denote the output as z = G 7?." with 
dimensionality n = 

To facilitate weight sharing within a feed forward neural 
network, we can simply substitute Eq. (3) into Eq. (2): 

m m 

Zi = '^^VijOj = Wfi(ij)aj. ( 4 ) 

1=1 1=1 

Alternatively and more in line with previous work (Wein¬ 
berger et al., 2009), we may interpret HashedNets in terms 
of feature hashing. To compute Zi, we first hash the acti¬ 
vations from the previous layer, a, with the hash mapping 
function —1 . We then compute the inner 

product between the hashed representation and the 

parameter vector w, 

Zj=w^/)j(a). (5) 

^https : //code . google . com/p/xxhash/ 
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Both w and are iC-dimensional, where K is the num¬ 
ber of hash buckets in this layer. The hash mapping func¬ 
tion (j)i is defined as follows. The element of 4>i{a), i.e. 
is the sum of variables hashed into bucket k: 

[M^)]k = 

j-.h(i,j)=k 

Starting from Eq. (5), we show that the two interpretations 
(Eq. (4) and (5)) are equivalent; 

K K 

Zi = 

k—1 k—1 j:h{i,j)—k 

m K 

= EE WkG'j^[h{i,j)—k] 

j^l k^l 
m 

The final term is equivalent to Eq. (4). 

Sign factor. With this equivalence between random 
weight sharing and feature hashing on input activations, 
HashedNets inherit several beneficial properties of the fea¬ 
ture hashing. Weinberger et al. (2009) introduce an ad¬ 
ditional sign factor to remove the bias of hashed 

inner-products due to collisions. For the same reasons we 
multiply (3) by the sign factor £(i, j) for parameterizing 
V (Weinberger et al., 2009): 

^ij ~ '^h{i,j)£ih j) 1 (7) 

where ±1 is a second hash function inde¬ 

pendent of h. Incorporating £{i,j) to feature hashing and 
weight sharing does not change the equivalence between 
them as the proof in the previous section still holds with 
the sign term (details omitted for improved readability). 


be extended to other kinds of neural networks, such as re¬ 
current neural networks (Pineda, 1987) or others (Bishop, 
1995). It can also be used in conjunction with other ap¬ 
proaches for neural network compression. All weights can 
be stored with low bit precision (Courbariaux et al., 2014; 
Gupta et al., 2015), edges could be removed (Ciregan et al., 
2011) and HashedNets can be trained on the outputs of 
larger networks (Ba & Caruana, 2014) — yielding further 
reductions in memory requirements. 

4.4. Training HashedNets 

Training HashedNets is equivalent to training a standard 
neural network with equality constraints for weight shar¬ 
ing. Here, we show how to (a) compute the output of a 
hash layer during the feed-forward phase, (b) propagate 
gradients from the output layer back to input layer, and (c) 
compute the gradient over the shared weights during the 
back propagation phase. We use dedicated hash functions 
between layers £ and £ -\- 1, and denote them as and 

Output. Adding the hash functions and £^{-) and 

the weight vectors into the feed forward update (2) re¬ 
sults in the following forward propagation rule: 

• ( 8 ) 

Error term. Let C denote the loss function for training 
the neural network, e.g. cross entropy or the quadratic 
loss (Bishop, 1995). Further, let 6j denote the gradient of 
C over activation j in layer £, also known as the error term. 
Without shared weights, the error term can be expressed as 
6j = where /'(•) represents the 

first derivative of the transition function /(•). If we substi¬ 
tute Eq. (7) into the error term we obtain; 



Sparsity. As pointed out in Shi et al. (2009) and Wein¬ 
berger et al. (2009), feature hashing is most effective on 
sparse feature vectors since the number of hash collisions is 
minimized. We can encourage this effect in the hidden lay¬ 
ers with sparsity inducing transition functions, e.g. rectified 
linear units (ReLU) (Glorot et al., 2011a) or through spe¬ 
cialized regularization (Chen et al., 2014; Boureau et al., 
2008). In our implementation, we use ReLU transition 
functions throughout, as they have also been shown to often 
result in superior generalization performance in addition to 
their sparsity inducing properties (Glorot et al., 2011a). 




2 = 1 


(9) 


Gradient over parameters. To compute the gradient of 
C with respect to a weight w^. we need the two gradients, 

^ and ^ (10) 

ij k 


Alternative neural network architectures. While this 
work focuses on general, fully connected feed forward neu¬ 
ral networks, the technique of HashedNets could naturally 


Here, the first gradient is the standard gradient of a (virtual) 
weight with respect to an activation unit and the second 
gradient ties the virtual weight matrix to the actual weights 
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through the hashed map. Combining these two, we obtain 


dC. 

dwl 


^ dVf, dwl 

2j ij ft, 

*=1 3 


( 11 ) 

( 12 ) 


5. Related Work 

Deep neural networks have achieved great progress on 
a wide variety of real-world applications, including im¬ 
age classification (Krizhevsky et al., 2012; Donahue et al., 
2013; Sermanet et al., 2013; Zeiler & Fergus, 2014), ob¬ 
ject detection (Girshick et al., 2014; Vinyals et al., 2014), 
image retrieval (Razavian et al., 2014), speech recognition 
(Hinton et al., 2012; Graves et al., 2013; Mohamed et al., 
2011), and text representation (Mikolov et al., 2013). 

There have been several previous attempts to reduce the 
complexity of neural networks under a variety of contexts. 
Arguably the most popular method is the widely used con¬ 
volutional neural network (Simard et al., 2003). In the con¬ 
volutional layers, the same filter is applied to every recep¬ 
tive field, both reducing model size and improving gener¬ 
alization performance. The incorporation of pooling layers 
(Zeiler & Fergus, 2013) can reduce the number of connec¬ 
tions between layers in domains exhibiting locality among 
input features, such as images. Autoencoders (Glorot et al., 
201 lb) share the notion of tied weights by using the same 
weights for the encoder and decoder (up to transpose). 

Other methods have been proposed explicitly to reduce the 
number of free parameters in neural networks, but not nec¬ 
essarily for reducing memory overhead. Nowlan & Hin¬ 
ton (1992) introduce soft weight sharing for regulariza¬ 
tion in which the distribution of weight values is modeled 
as a Gaussian mixture. The weights are clustered such 
that weights in the same group have similar values. Since 
weight values are unknown before training, weights are 
clustered during training. This approach is fundamentally 
different from HashedNets since it requires auxiliary pa¬ 
rameters to record the group membership for every weight. 

Instead of sharing weights, LeCun et al. (1989) intro¬ 
duce “optimal brain damage” to directly drop unimpor¬ 
tant weights. This approach requires auxiliary parameters 
for storing the sparse weights and needs retraining time to 
fine-tune the resulting architecture. Cire§an et al. (2011) 
demonstrate in their experiments that randomly remov¬ 
ing connections leads to superior empirical performance, 
which shares the same spirit of HashedNets. 


(Gupta et al., 2015) for a compression factor of ^ over 
double-precision floating point). Experiments indicate lit¬ 
tle reduction in accuracy compared with models trained 
with double-precision floating point representation. These 
methods can be readily incorporated with HashedNets, po¬ 
tentially yielding further reduction in model storage size. 

A recent study by Denil et al. (2013) demonstrates sig¬ 
nificant redundancy in neural network parameters by di¬ 
rectly learning a low-rank decomposition of the weight ma¬ 
trix within each layer. They demonstrate that networks 
composed of weights recovered from the learned decom¬ 
positions are only slightly less accurate than networks 
with all weights as free parameters, indicating heavy over- 
parametrization in full weight matrices. A follow-up work 
by Denton et al. (2014) uses a similar technique to speed up 
test-time evaluation of convolutional neural networks. The 
focus of this line of work is not on reducing storage and 
memory overhead, but evaluation speed during test time. 
HashedNets is complementary to this research, and the two 
approaches could be used in combination. 

Following the line of model compression, Bucilu et al. 
(2006), Hinton et al. (2014) and Ba & Caruana (2014) re¬ 
cently introduce approaches to learn a “distilled” model, 
training a more compact neural network to reproduce the 
output of a larger network. Specifically, Hinton et al. 
(2014) and Ba & Caruana (2014) train a large network on 
the original training labels, then learn a much smaller “dis¬ 
tilled” model on a weighted combination of the original la¬ 
bels and the (softened) softmax output of the larger model. 
The authors show that the distilled model has better gen¬ 
eralization ability than a model trained on just the labels. 
In our experimental results, we show that our approach is 
complementary by learning HashedNets with soft targets. 
Rippel et al. (2014) propose a novel dropout method, nested 
dropout, to give an order of importance for hidden neurons. 
Hypothetically, less important hidden neurons could be re¬ 
moved after training, a method orthogonal to HashedNets. 

Ganchev & Dredze (2008) are among the first to recognize 
the need to reduce the size of natural language process¬ 
ing models to accommodate mobile platform with limited 
memory and computing power. They propose random fea¬ 
ture mixing to group features at random based on a hash 
function, which dramatically reduces both the number of 
features and the number of parameters. With the help of 
feature hashing (Weinberger et al., 2009), Vowpal Wabbit, 
a large-scale learning system, is able to scale to terafeature 
datasets (Agarwal et al., 2014). 

6. Experimental Results 


Courbariaux et al. (2014) and Gupta et al. (2015) learn We conduct extensive experiments to evaluate HashedNets 
networks with reduced numerical precision for storing on eight benchmark datasets. For full reproducibility, our 
model parameters (e.g. 16-bit fixed-point representation 
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Dataset: MNiST, Layers=3, Hidden Units=1000 


Low-Rank Decomposition (LRD) 
Random Edge Removal (RER) 
Neural Network, Equiv. Size (NN) 
Dark Knowledge (DK) 

HashNet 
HashNetDK 



1/16 1/8 1/4 

Compression Factor 


Dataset: rot, Layers=3, Hidden Units=1000 



Figure 2. Test error rates under varying compression factors 


Dataset: MNIST, Layers=5, Hidden Units=1000 



with 3-layer networks on MNIST {left) and ROT {right). 


Dataset: rot, Layers=5, Hidden Units=1000 



Figure 3. Test error rates under varying compression factors with 5-layer networks on MNIST {left) and ROT {right). 


code is available at http : //www . weinbergerweb . com. 

Datasets. Datasets consist of the original MNIST hand¬ 
written digit dataset, along with four challenging variants 
(Larochelle et ah, 2007). Each variation amends the orig¬ 
inal through digit rotation (ROT), background superimpo¬ 
sition (BG-RAND and BG-IMG), or a combination thereof 
(bg-IMG-ROT). In addition, we include two binary im¬ 
age classification datasets: CONVEX and RECT (Larochelle 
et al., 2007). All data sets have pre-specihed training and 
testing splits. Original MNIST has splits of sizes n = 60000 
(training) and n = 10000 (testing). Both CONVEX and 
RECT and as well as each MNIST variation set has n = 
12000 (training) and n = 50000 (testing). 

Baselines and method. We compare HashedNets with 
several existing techniques for size-constrained, feed¬ 
forward neural networks. Random Edge Removal (RER) 
(Cire§an et al., 2011) reduces the total number of model 
parameters by randomly removing weights prior to train¬ 
ing. Low-Rank Decomposition (LRD) (Denil et al., 2013) 
decomposes the weight matrix into two low-rank matrices. 
One of these component matrices is fixed while the other 
is learned. Elements of the hxed matrix are generated ac¬ 
cording to a zero-mean Gaussian distribution with standard 


deviation —i= with inputs to the layer. 

V ■' 

Each model is compared against a standard neural network 
with an equivalent number of stored parameters. Neural 
Network (Equivalent-Size) (NN). For example, for a net¬ 
work with a single hidden layer of 1000 units and a stor¬ 
age compression factor of we adopt a size-equivalent 
baseline with a single hidden layer of 100 units. For deeper 
networks, all hidden layers are shrunk at the same rate until 
the number of stored parameters equals the target size. In a 
similar manner, we examine Dark Knowledge (DK) (Hin¬ 
ton et al., 2014; Ba & Caruana, 2014) by training a distilled 
model to optimize the cross entropy with both the original 
labels and soft targets generated by the corresponding full 
neural network (compression factor 1). The distilled model 
structure is chosen to be same as the “equivalent-sized” net¬ 
work (NN) at the corresponding compression rate. 

Finally, we examine our method under two settings; 
learning hashed weights with the original training labels 
(HashNet) and with combined labels and DK soft targets 
(HashNetDK)- In all cases, memory and storage consump¬ 
tion is dehned strictly in terms of free parameters. As such, 
we count the fixed low rank matrix in the Low-Rank De¬ 
composition method as taking no memory or storage (pro- 
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RER 

LRD 

NN 

3 Layers 
DK 

HashNet 

HashNetDK 

RER 

LRD 

NN 

5 Layers 
DK 

HashNet 

HashNetDK 

MNIST 

2.19 

1.89 

1.69 

1.71 

1.45 

1.43 

1.24 

1.77 

1.35 

1.26 

1.22 

1.29 

BASIC 

3.29 

3.73 

3.19 

3.18 

2.91 

2.89 

2.87 

3.54 

2.73 

2.87 

2.62 

2.85 

ROT 

14.42 

13.41 

12.65 

11.93 

11.17 

10.34 

9.89 

11.98 

9.61 

9.46 

8.87 

8.61 

BG-RAND 

18.16 

45.12 

13.00 

12.41 

13.38 

12.27 

11.31 

45.02 

11.19 

10.91 

10.76 

10.96 

BG-IMG 

24.18 

38.83 

20.93 

19.31 

22.57 

18.92 

19.81 

35.06 

19.33 

18.94 

19.07 

18.49 

BG-IMG-ROT 

59.29 

67.00 

52.90 

53.01 

51.96 

50.05 

45.67 

64.28 

48.47 

48.22 

46.67 

46.78 

RECT 

27.32 

32.73 

23.91 

24.74 

27.06 

22.93 

27.13 

35.79 

24.58 

23.86 

29.58 

25.99 

CONVEX 

3.69 

4.56 

4.24 

3.07 

3.23 

2.96 

3.92 

7.09 

3.43 

2.37 

3.92 

2.36 


Table 1. Test error rates (in %) with a compression factor of | across all data sets. Best results are printed in hlue. 



RER 

LRD 

NN 

3 Layers 
DK 

HashNet 

HashNetDK 

RER 

LRD 

NN 

5 Layers 
DK 

HashNet 

HashNetDK 

MNIST 

15.03 

28.99 

6.28 

6.32 

2.79 

2.65 

3.20 

28.11 

2.69 

2.16 

1.99 

1.92 

BASIC 

13.95 

26.95 

7.67 

8.44 

4.17 

3.79 

5.31 

27.21 

4.55 

4.07 

3.49 

3.19 

ROT 

49.20 

52.18 

35.60 

35.94 

18.04 

17.62 

25.87 

52.03 

16.16 

15.30 

12.38 

11.67 

BG-RAND 

44.90 

76.21 

43.04 

53.05 

21.50 

20.32 

90.28 

76.21 

16.60 

14.57 

16.37 

13.76 

BG-IMG 

44.34 

71.27 

32.64 

41.75 

26.41 

26.17 

55.76 

70.85 

22.77 

23.59 

22.22 

20.01 

BG-IMG-ROT 

73.17 

80.63 

79.03 

77.40 

59.20 

58.25 

88.88 

80.93 

53.18 

53.19 

51.93 

54.51 

RECT 

37.22 

39.93 

34.37 

31.85 

31.77 

30.43 

50.00 

39.65 

29.76 

26.95 

29.70 

32.04 

CONVEX 

18.23 

23.67 

5.68 

5.78 

3.67 

3.37 

50.03 

23.95 

4.28 

3.10 

5.67 

2.64 


Table 2. Test error rates (in %) with a compression factor of ^ across all data sets. Best results are printed in blue. 


viding this baseline a slight advantage). 

Experimental setting. HashedNets and all accompany¬ 
ing baselines were implemented using Torch? (Collobert 
et al., 2011) and run on NVIDIA GTX TITAN graphics 
cards with 2688 cores and 6GB of global memory. We use 
32 bit precision throughout but note that the compression 
rates of all methods may be improved with lower preci¬ 
sion (Courbariaux et al., 2014; Gupta et al., 2015). We 
verify all implementations by numerical gradient checking. 
Models are trained via stochastic gradient descent (mini¬ 
batch size of 50) with dropout and momentum. ReLU is 
adopted as the activation function for all models. Hyper¬ 
parameters are selected for all algorithms with Bayesian 
optimization (Snoek et al., 2012) and hand tuning on 20% 
validation splits of the training sets. We use the open 
source Bayesian Optimization MATLAB implementation 
“bayesopt.m” from Gardner et al. (2014).^ 

Results with varying compression. Figures 2 and 3 
show the performance of all methods on MNIST and the 
ROT variant with different compression factors on 3-layer 
(1 hidden layer) and 5-layer (3 hidden layers) neural net¬ 
works, respectively. Each hidden layer contains 1000 hid¬ 
den units. The cc-axis in each figure denotes the fractional 
compression factor. For HashedNets and the low rank de¬ 
composition and random edge removal compression base¬ 
lines, this means we fix the number of hidden units (n^) and 

^http : //tinyurl . com/bayesopt 


vary the storage budget (K^) for the weights (w^). 

We make several observations: The accuracy of HashNet 
and HashNetDK outperforms all other baseline methods, es¬ 
pecially in the most interesting case when the compression 
factor is small (i.e. very small models). Both compres¬ 
sion baseline algorithms, low rank decomposition and ran¬ 
dom edge removal, tend to not outperform a standard neural 
network with fewer hidden nodes (black line), trained with 
dropout. For smaller compression factors, random edge re¬ 
moval likely suffers due to a significant number of nodes 
being entirely disconnected from neighboring layers. The 
size-matched NN is consistently the best performing base¬ 
line, however its test error is significantly higher than that 
of HashNet especially at small compression rates. The use 
of Dark Knowledge training improves the performance of 
HashedNets and the standard neural network. Of all meth¬ 
ods, only HashNet and HashNetpK maintain performance 
for small compression factors. 

For completeness, we show the performance of all meth¬ 
ods on all eight datasets in Table 1 for compression fac¬ 
tor i and Table 2 for compression factor HashNet and 
HashNetDK outperform other baselines in most cases, espe¬ 
cially when the compression factor is very small (Table 2). 
With a compression factor of on average only 0.5 bits of 
information are stored per (virtual) parameter. 

Results with fixed storage. We also experiment with the 
setting where the model size is fixed and the virtual network 
architecture is “inflated”. Essentially we are fixing (the 
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Dataset: mnist, Layers=3, Hidden Units=50 Dataset: mnist, Layers=5, Hidden Units=50 




Figure 4. Test error rates with fixed storage but varying expansion factors on MNIST with 3 layers {left) and 5 layers {right). 


number of “real” weights in w^), and vary the number of 
hidden nodes (r/). An expansion factor of 1 denotes the 
case where every virtual weight has a corresponding “real” 
weight, [r/ + = K^. Figure 4 shows the test error 

rate under various expansion rates of a network with one 
hidden layer {left) and three hidden layers {right). In both 
scenarios we fix the number of real weights to the size of 
a standard fully-connected neural network with 50 hidden 
units in each hidden layer whose test error is shown by the 
black dashed line. 

With no expansion (at expansion rate 1), different compres¬ 
sion methods perform differently. At this point edge re¬ 
moval is identical to a standard neural network and matches 
its results. If no expansion is performed, the HashNet per¬ 
formance suffers from collisions at no benefit. Similarly 
the low-rank method still randomly projects each layer to a 
random feature space with same dimensionality. 

For expansion rates greater 1, all methods improve over 
the fixed-sized neural network. There is a general trend 
that more expansion decreases the test error until a “sweet- 
spot” after which additional expansion tends to hurt. The 
test error of the HashNet neural network decreases sub¬ 
stantially through the introduction of more “virtual” hidden 
nodes, despite that no additional parameters are added. In 
the case of the 5-layer neural network (right) this trend is 
maintained to an expansion factor of 16x, resulting in 800 
“virtual” nodes. One could hypothetically increase ar¬ 
bitrarily for HashNet, however, in the limit, too many hash 
collisions would result in increasingly similar gradient up¬ 
dates for all weights in w. 

The benefit from expanding a network cannot continue for¬ 
ever. In the random edge removal the network will become 
very sparsely connected; the low-rank decomposition ap¬ 
proach will eventually lead to a decomposition into rank- 
1 matrices. HashNet also respects this trend, but is much 
less sensitive when the expansion goes up. Best results are 
achieved when networks are inflated by a factor 8 — 16x. 


7. Conclusion 

Prior work shows that weights learned in neural networks 
can be highly redundant (Denil et al., 2013). HashedNets 
exploit this property to create neural networks with “vir¬ 
tual” connections that seemingly exceed the storage limits 
of the trained model. This can have surprising effects. Fig¬ 
ure 4 in Section 6 shows the test error of neural networks 
can drop nearly 50%, from 3% to 1.61%, through expand¬ 
ing the number of weights “virtually” by a factor 8x. Al¬ 
though the collisions (or weight-sharing) might serve as 
a form of regularization, we can probably safely ignore 
this effect as both networks (with and without expansion) 
were also regularized with dropout (Srivastava et al., 2014) 
and the hyper-parameters were carefully fine-tuned through 
Bayesian optimization. 

So why should additional virtual layers help? One answer 
is that they probably truly increase the expressiveness of the 
neural network. As an example, imagine we are provided 
with a neural network with 100 hidden nodes. The internal 
weight matrix has 10000 weights. If we add another set 
of m hidden nodes, this increases the expressiveness of the 
network. If we require all weights of connections to these 
m additional nodes to be “re-used” from the set of exist¬ 
ing weights, it is not a strong restriction given the large 
number of weights in existence. In addition, the backprop 
algorithm can adjust the shared weights carefully to have 
useful values for all their occurrences. 

As future work we plan to further investigate model com¬ 
pression for neural networks. One particular direction of 
interest is to optimize HashedNets for GPUs. GPUs are 
very fast (through parallel processing) but usually feature 
small on-board memory. We plan to investigate how to use 
HashedNets to fit larger networks onto the finite memory 
of GPUs. A specific challenge in this scenario is to avoid 
non-coalesced memory accesses due to the pseudo-random 
hash functions—a sensitive issue for GPU architectures. 
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